Learning Spatial-Frequency Transformer for Visual Object Tracking

نویسندگان

چکیده

Recently, some researchers have begun to adopt the Transformer combine or replace widely used ResNet as their new backbone network. As captures long-range relations between pixels well using self-attention scheme, which complements issues caused by limited receptive field of CNN. Although trackers work in regular scenarios, they simply flatten 2D features into a sequence better match Transformer. We believe these operations ignore spatial prior target object, may lead sub-optimal results only. In addition, many works demonstrate that is actually low-pass filter, independent input keys/queries. That say, it suppress high-frequency component and preserve even amplify low-frequency information. To handle issues, this paper, we propose unified Spatial-Frequency models Gaussian Prior High-frequency emphasis Attention (GPHA) simultaneously. be specific, generated dual Multi-Layer Perceptrons (MLPs) injected similarity matrix produced multiplying Query Key self-attention. The output will fed softmax layer then decomposed two components, i.e., direct signal. low- high-pass branches are rescaled combined achieve all-pass, therefore, protected stacked layers. further integrate Siamese tracking framework novel algorithm termed SFTransT. cross-scale fusion based SwinTransformer adopted backbone, also multi-head cross-attention module boost interaction search template features. head for localization. Extensive experiments on short-term long-term benchmarks all effectiveness our proposed framework. Source code released at https://github.com/Tchuanm/SFTransT.git.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning structured visual dictionary for object tracking

متن کامل

Learning Spatial-Aware Regressions for Visual Tracking

In this paper, we analyze the spatial information of deep features, and propose two complementary regressions for robust visual tracking. First, we propose a kernelized ridge regression model wherein the kernel value is defined as the weighted sum of similarity scores of all pairs of patches between two samples. We show that this model can be formulated as a neural network and thus can be effic...

متن کامل

Visual Learning in Multiple-Object Tracking

BACKGROUND Tracking moving objects in space is important for the maintenance of spatiotemporal continuity in everyday visual tasks. In the laboratory, this ability is tested using the Multiple Object Tracking (MOT) task, where participants track a subset of moving objects with attention over an extended period of time. The ability to track multiple objects with attention is severely limited. Re...

متن کامل

Convolutional Gating Network for Object Tracking

Object tracking through multiple cameras is a popular research topic in security and surveillance systems especially when human objects are the target. However, occlusion is one of the challenging problems for the tracking process. This paper proposes a multiple-camera-based cooperative tracking method to overcome the occlusion problem. The paper presents a new model for combining convolutiona...

متن کامل

Learning Object Intrinsic Structure for Robust Visual Tracking

In this paper, a novel method to learn the intrinsic object structure for robust visual tracking is proposed. The basic assumption is that the parameterized object state lies on a low dimensional manifold and can be learned from training data. Based on this assumption, firstly we derived the dimensionality reduction and density estimation algorithm for unsupervised learning of object intrinsic ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Transactions on Circuits and Systems for Video Technology

سال: 2023

ISSN: ['1051-8215', '1558-2205']

DOI: https://doi.org/10.1109/tcsvt.2023.3249468